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PREFACE 


The  work  described  in  this  report  was  authorized  under 
Project  No.  1C162622A553,  CB  Defense/General  Investigation.  This 
work  was  started  in  June  1990  and  completed  in  August  1990. 

The  use  of  trade  names  or  manufacturers ’  names  in  this 
report  does  not  constitute  an  official  endorsement  of  any  commer¬ 
cial  products.  This  report  may  not  be  cited  for  purposes  of 
advertisement . 

Reproduction  of  this  document  in  whole  or  in  part  is 
prohibited  except  with  permission  of  the  Commander,  U.S.  Army 
Chemical  Research,  Development  and  Engineering  Center,  ATTN: 
SMCCR-SPS-T,  Aberdeen  Proving  Ground,  MD  21010-5423.  However, 
the  Defense  Technical  Information  Center  and  the  National  Techni¬ 
cal  Information  Service  are  authorized  to  reproduce  the  document 
for  U.S.  Government  purposes. 

This  report  has  been  approved  for  release  to  the 

public. 
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USE  OF  STATISTICS  IN  COMPUTATIONAL  CHEMISTRY 


1.  INTRODUCTION 

With  increased  concern  for  the  environment  and  the 
tightenincr  of  regulatory  requirements,  it  is  becoming  increas¬ 
ingly  more  difficult  and  more  expensive  to  conduct  exploratory 
experiments  in  the  chemical  laboratory.  With  the  significant 
advancements  that  have  been  made  in  recent  years  in  the  area  of 
computational  chemistry,  it  is  now  possible  to  calculate  many 
physical  and  chemical  properties  using  one  of  the  commonly 
available  computational  methods.  However,  at  this  time,  it  is 
not  possible  to  estimate  how  close  th-  calculated  property  of  a 
new  compound  is  tr  the  true  value. 

In  a  previous  report,  Birenzvige  and  co-workers1  use 
statistical  methods  to  compare  the  ability  of  the  three  most 
commonly  used  computational  methods  (MNDO,  PM3 ,  and  AMI)  to 
predict  the  heat  of  formation,  dipole  moment,  polarizability  and 
ionization  potential  of  12  vanillic-type  molecules.  The  result 
of  that  study  shows  that  by  using  proper  statistical  tools,  we 
can  predict  the  precision  and  accuracy  of  the  computational 
method.  In  addition,  Birenzvige  and  co-workers  show  that  not  one 
method  is  suited  to  calculate  all  the  physical  properties. 

Stewart2  published  an  up-to-date  summary  of  computated 
heat  of  formation  (775  molecules) ,  geometries  (bond  length  and 
bond  angle  -  209  molecules,  including  174  bond  angles  and  372 
bond  lengths),  dipole  moment  (125  molecules),  and  ionization 
potential  (256  molecules) .  In  this  report,  we  extend  the  methods 
used  before  to  compare  how  well  the  three  computational  methods 
can  predict  the  various  physical/chemical  properties. 


2.  PRINCIPLES  OF  STATISTICAL  METHODS  USED 

As  previously  stated,  the  purpose  of  this  study  was  to 
determine  the  ability  of  each  of  the  three  semi-empirical  methods 


birenzvige,  A.,  Sturdivan,  L. ,  Famini,  G.R. ,  Krishnan,  P.N.,  and 
Morris,  R.E.,  Predicting  Polymer  Properties  bv  Computational 

. Zi _ A  ■Comparison  of  Semi-Empirical  Methods.  CRDEC- 

TR-361,  U.S.  Army  Chemical  Research,  Development  and  Engineer¬ 
ing  Center,  Aberdeen  Proving  Ground,  MD,  September  1992, 
UNCLASSIFIED  Report  (A256  856) . 

zStewart,  James  J.P.,  "Optimization  of  Parameters  for  Semi- 
empirical  Methods,  II.  Applications,”  J.  Comput.  Chem.  Vol  10, 
p  221  (1989) . 
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to  calculate  the  physical /chemical  property  in  question.  To 
enable  us  to  determine  the  accuracy  of  the  calculation  (i.e.,  the 
standard  deviation  of  the  error)  in  a  statistically  meaningful 
way,  we  need  to  show  that  the  calculation  errors  are  symmetrical¬ 
ly  distributed  about  the  real  (experimental)  values.  This  is 
conventionally  done  by  showing  that  the  data  follow  the  normal 
distribution  function.  One  way  of  showing  that  a  data  set  is 
normally  distributed  is  to  order  it  in  an  ascending  older  and 
plot  the  data  on  a  normal  distribution  graph  paper.  For  example, 
suppose  we  take  the  weight  of  nine  people  (n*9)  selected  at 
random.  First,  we  sort  them  in  ascending  order;  then  we  scale 
the  linear  Y  axis  so  that  all  weights  will  fit.  Finally,  we  plot 
the  cumulative  fraction  on  the  probability  axis  versus  the  weight 
on  the  Y  axis,  letting  the  denominator  of  the  fraction  equal  n+1 
(for  symmetry).  Thus,  the  lightest  weight  would  be  plotted 
versus  0.1  (1/n+l),  the  next  lightest  versus  0.2,  and  so  on  until 
the  heaviest  would  be  plotted  against  0.9  (n/n+1).  If  the 
weights  were  normally  distributed,  the  resulting  nine  points 
would  fall  on  a  straight  line.  Alternatively,  we  can  calculate 
the  normal  score  that  is  the  expected  value  of  the  normal  order 
statistic  of  an  ordered  sample  of  size  n.  In  the  statistical 
package  MINITAB(t-),  the  normal  score  is  abbreviated  N-score. 
Plotting  the  N-score  against  normally  distributed  data  will 
result  in  the  points  falling  about  a  straight  line. 

Calculating  the  N-score  requires  numerical  solution  of 
integral  equations.  Calculation  of  N-score  is  available  in  some 
statistical  packages  on  minicomputers  but  is  not  available  in 
commonly  used  software  packages  for  microcomputers.  To  enable  us 
to  perform  the  analysis  on  a  desk- top  microcomputer,  we  need  to 
find  a  distribution  function  that  will  closely  resemble  the 
normal  distribution  but  will  be  easier  to  compute.  The  logistic 
distribution  is  such  a  distribution.  Its  straight  line  trans¬ 
form,  which  we  will  call  the  L-score,  is  obtainable  in  closed 
form  and  is  simple  to  calculate. 

Figure  1  shows  a  comparison  of  L-score  and  N-score.  It 
was  produced  as  follows;  First,  we  calculated  the  N-score  of  an 
ordered  set  of  numbers  from  1  to  1000  using  minitab<tB)  on  the  VAX 
minicomputer.  We  then  downloaded  the  data  into  a  spread  sheet  on 
a  desk-top  PC  and  calculated  the  L-score  according  to  the  follow¬ 
ing  equation; 


L-scoreU) 


In 


i 

n-i+l 


(1) 


where  i  is  the  order  of  the  item  in  the  list,  and  n  is  the  total 
number  of  items.  The  dashed  line  is  the  plot  of  L-score  versus 
N-score.  The  solid  line  is  a  least-squares-fitted  straight  line 
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through  the  data.  As  can  be  seen,  the  two  lines  coincide  except 
at  the  ends  where  the  slightly  heavier  tails  of  the  logistic 
distribution  cause  a  bit  of  curvature  away  from  the  straight 
line.  The  correlation  coefficient  (R-squared)  of  the  two  mea¬ 
sures  is  0.994.  When  real  data  are  plotted  versus  N-score  and, 
separately,  versus  L-score,  there  is  no  visible  difference  in  the 
"straightness,"  or  lack  of  it,  between  the  two  plots. 

For  each  molecule,  the  calculation  errors  (as  reported 
by  Stewart)2  were  plotted  against  L-score,  and  the  correlation 
coefficient  of  the  least  squares  regression  line  was  determined. 
The  plot  was  examined  visually  to  determine  any  outliers  and 
whether  the  fit  will  improve  in  a  limited  region.  The  average 
and  standard  deviation  of  the  calculation  errors  were  calculated 
in  the  region  of  symmetry  as  was  the  R-square  for  the  least 
square  regression  line.  To  choose  the  best  method  for 
calculating  the  physical  property  in  the  region  at  a  95% 
confidence  level,  the  following  procedures  should  be  followed:3 

•  For  each  of  the  three  different  methods,  plot  the 
difference  between  the  calculated  and  experimental  values  of  the 
property  approximated  versus  its  L-score.  (Alternatively,  plot 
the  difference  of  the  transformed  data  versus  the  L-score.) 

•  Determine  a  region  (magnitude  of  calculated  and 
experimental  values)  where  the  L-score  plot  forms  a  straight 
line,  indicating  a  symmetrical  "normal”  distribution  of  errors. 
Transformed  and  untransformed  data  may  have  to  be  used  for 
different  regions  (e.g.,  the  data  might  be  normally  distributed 
in  one  region  and  lognormal  in  another) . 

•  Calculate  the  R-square  between  the  difference  and 
the  L-score  for  the  appropriate  region. 

•  Calculate  the  mean  and  standard  deviation  of  the 
approximation  error  in  the  appropriate  region. 

•  A  method  that  has  an  R-square  of  0.94  or  larger  is 
well  approximated  by  the  normal  distribution.  Among  those  that 
satisfy  this  criterion,  choose  the  method  that  has  the  smallest 
standard  deviation,  unless  that  standard  deviation  is  >2.28  times 
the  size  of  the  standard  deviation  of  another  method  whose  R- 
square  is  <0.94.  In  the  latter  case,  choose  the  method  with  the 
smaller  standard  deviation  regardless  of  the  value  of  R-square. 


^ood,  A.M. ,  Graybill,  F.A.,  and  Boes,  D.C.,  Introduction  to  the 
Theory  of  Statistics.  3rd  ed.,  McGraw-Hill,  New  York,  NY,  1974. 


•  Approximately  95%  of  the  time,  the  experimental 
values  will  be  in  the  range  < (calculated  value  -  bias)  ±  2 a>. 


3.  RESULTS  AND  DISCUSSION 

3 . 1  Heat  of  Formation. 

Figures  2-4  are  plots  of  the  differences  between  the 
calculated  experimental  heat  of  formation  for  PM3,  MNDO,  and  AMI, 
respectively.  In  each  case,  the  straight  line  represents  the 
least  squares  regression  line  of  the  calculation  errors  versus  L- 
score.  As  clearly  seen  only  in  the  case  of  PM3,  the  R-square  is 
larger  than  0.94.  The  PM3  calculation  method  has  an  average  bias 
of  0.44  Kcal/mole  (i.e.,  the  calculation  errors  are  normally  dis¬ 
tributed  about  an  error  of  0.44  Kcal/mole)  and  a  standard  devia¬ 
tion  of  12.3  Kcal/mole.  Thus,  the  true  heat  of  formation  for  a 
compound  with  no  experimental  value  will  be  as  follows: 


True  Value  =  (Calculated  Value  -0.44)  ±12. 3x2  (2) 

with  95%  confidence. 


Table  1  lists  the  seven  compounds,  which  have  the 
largest  deviation  (these  compounds  are  marked  in  Figure  2) .  If 
these  compounds  are  eliminated  from  the  analysis,  the  precision 
(bias)  and  accuracy  (standard  deviation)  of  the  method  improve 
somewhat  (0.37  Kcal/mole  and  11.1  Kcal/mole,  respectively). 

3.2  laniaaSian  P9.tqflti.ai . 

Figures  5-7  show  that  the  calculated  ionization  poten¬ 
tial  for  all  three  methods  are  normally  distributed  about  the 
experimental  values.  MNDO  provides  slightly  higher  accuracy 
(standard  deviations  of  0.7  ev  versus  0.73  ev  for  PM3  and  AMI, 
respectively)  and-  should  be  the  preferred  method  even  though  it 
has  higher  bias  (0.68  ev  versus  0.12  ev  and  0.40  ev  for  PM3  and 
AMI,  respectively) . 

3 . 3  Dipole  Moment. 

The  correlation  coefficient  of  the  calculation  errors 
of  the  dipole  moment  versus  L-score  for  PM3  (Figure  8)  is  only 
0.92;  thus,  this  method  is  not  suitable  for  calculating  the 
dipole  moment.  The  correlation  coefficient  for  both  MNDO 
(Figure  9)  and  AMI  (Figure  10)  is  0.96.  The  standard  deviation 
for  AMI  is  smaller  (0.49  deby)  than  the  standard  deviation  for 
MNDO  (0.63  deby).  Thus,  AMI  should  be  the  preferred  method  for 
calculating  dipole  moment.  The  table  under  each  figure 
(Tables  2-4)  lists  the  compounds  with  the  largest  deviation  from 
the  regression  line. 
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3.4 


Bond  Length. 


As  can  be  seen  (Figures  11-13) ,  none  of  the  three 
methods  evalua'ted  in  this  study  produces  a  data  base  of  bond 
length,  which  is  normally  distributed  about  the  true  (experimen¬ 
tal)  values.  In  each  case,  there  were  few  clear  outliers,  which 
are  listed  in  Tables  5-7  (nine  compounds  for  PM3 ,  five  compounds 
for  MNDO,  and  seven  compounds  for  AMI) .  When  these  outliers 
were  removed  from  considerations,  the  correlation  coefficient 
increased  considerably  (Figures  14-16) .  However,  in  the  case  of 
PM3  and  MNDO,  the  correlation  coefficient  still  remained  below 
0.94.  In  the  case  of  AMI  when  points  1-5  and  point  8  were 
removed  (these  points  show  the  largest  deviation  from  the  regres¬ 
sion  line),  the  correlation  coefficient  increased  to  0.94. 

3 . 5  Bond  Angle. 

Of  the  three  methods  evaluated  (Figures  17-19) ,  it 
appears  that  only  AMI  shows  a  tendency  for  symmetry  (Figure  19) 
even  though  there  are  few  clear  outliers.  After  removing  the 
most  extreme  outliers  from  consideration  in  PM3  and  MNDO 
(Tables  8-9) ,  the  correlation  coefficient  in  PM3  increases  to 
0.92  (Figure  20)  -  not  enough  to  show  that  the  data  are  normally 
distributed.  The  correlation  coefficient  for  MNDO  remains  very 
low  (Figure  21).  For  AMI,  once  the  six  compounds  listed  in 
Table  10  are  removed  from  consideration,  the  correlation 
coefficient  between  the  computation  errors  and  their 
corresponding  L-score  exceeds  the  required  0.94  (Figure  22). 


4.  CONCLUSIONS  AND  RECOMMENDATIONS 

By  using  appropriate  statistical  tools,  we  show  that 
semi-empirical  computational  methods  can  be  used  as  screening 
methods  to  predict  different  physical  or  chemical  compounds  with 
a  high  degree  of  confidence.  Furthermore,  we  show  that  not  one 
of  the  commonly  used  computational  methods  is  adequate  for 
computing  all  of  the  properties. 

For  each  of  the  properties  studied,  the  computation 
error  can  be  divided  by  a  systematic  error  or  bias  (the  precision 
of  the  calculation)  and  a  random  error  (the  accuracy  of  the 
calculation  or  standard  deviation) .  Table  11  lists  the  different 
properties,  the  recommended  computational  method,  and  its  bias 
and  standard  deviation.  When  calculating  the  property  of  an 
unknown  compound,  the  true  value  can  be  evaluated  as  follows 
(with  95%  confidence) : 

True  Value  =  (Computed  Value  -  Bias)  +  2o  (3) 
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CALCULATION  ERRORS 


Figure  2.  Test  for  Normal  Distribution  of  Calculation  Errors  - 
Heat  of  Formation,  PM3 


Table  1.  Compound  with  Largest.  Deviation  of  Calculated  Error 
of  Heat  of  Formation  Using  the  PM3  Method 
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Figure  3.  Test  for  Normal  Distribution  of  Calculation  Errors  - 
Heat  of  Formation,  MNDO 


Figure  4.  Test  for  Normal  Distribution  of  Calculation  Errors  - 
Heat  of  Formation,  AMI 
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Figure  6.  Test  for  Normal  Distribution  of  Calculation  Errors 
-  Ionization  Potential,  PM3 
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IONIZATION  POTENTIAL 
AM! 


Figure  7 


Test  for  Normal  Distribution  of  Calculation  Errors 
-  Ionization  Potential,  AMI 
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CALCULATION  ERRORS 


DIPOLE  MOMENT 
PM3 
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Figure  8.  Test  for  Normal  Distribution  of  Calculation  Errors 
-  Dipole  Moment,  PM3 


Table  2.  Compound  with  Largest  Deviation  of  Calculated  Error 
of  Dipole  Moment  Using  the  PM3  Method 


CALCULATION  ERRORS 


DIPOLE  MOMENT 
MNOO 


Figure  9.  Test  for  Normal  Distribution  of  Calculation  Errors 
-  Dipole  Moment,  MNDO 


Table  3. 


Compound  with  Largest  Deviation  of  Calculated  Error 
of  Dipole  Moment  Using  the  MNDO  Method 


T 


T 


CHEMICAL 


EXPERIMENTAL 
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Figure  11.  Test  for  Normal  Distribution  of  Calculation  Errors 
-  Bond  Length,  PM3 


Table  5.  Compound  with  Largest  Deviation  of  Calculated  Error 
of  Bond  Length  Using  the  PM3  Method 


CALCULATION 


Bond  Length  -  MNOO 


Figure  12.  Test  for  Normal  Distribution  of  Calculation  Errors 
-  Bond  Length,  MNDO 


Table  6.  Compound  with  Largest  Deviation  of  Calculated  Error 
of  Bond  Length  Using  the  MNDO  Method 


• 

CHEMICAL 

BOND 

EXPERIME 

NTAL 

CALC. 

ERRORS 

1 

n2os 

N-N 

2.08 

-0.706 

2 

H,F,  ( HF  dimmer ) 

H'-F 

1.87 

1.093 

3 

C7H7NOj 

( salicylaldoxime ) 

N( 14  )- 
H(  17 ) 

1.834 

0.94 

4 

H40,  ( HjO  dimmer) 

0-0 

3.0 

0.905 

1  5 

c7h7no2 

( salicylaldoxime ) 

0(10)- 
N(  14 ) 

2.626 

0.781 

I  6 

F7I 

I -F  (ax) 

1.76 

0.731 
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Figure  13.  Test  for  Normal  Distribution  of  Calculation  Errors 
-  Bond  Length,  AMI 


Table  7.  Compound  with  Largest  Deviation  of  Calculated  Error 
of  Bond  Length  Using  the  AMI  Method 


EXPERIME 

CALC. 

NTAL 

ERRORS 

2.08 

-0.728 

3.00 

-0.383 

1.76 

0.866 

1.87 

0.417 

1.46 

0.328 

1.834 

0.302 

2.684 

0.298 

1.42 

0.287 

Figure  14.  Test  for  Normal  Distribution  of  Calculation  Errors 
-  Bond  Length,  PM3  -  Extremes  Excluded 


Figure  15.  Test  for  Normal  Distribution  of  Calculation  Errors 
-  Bond  Length,  MNDO  -  Extremes  Excluded 
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Figure  16.  Test  for  Normal  Distribution  of  Calculation  Errors 
-  Bond  Length,  AMI  -  Extremes  Excluded 
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Figure  17.  Test  for  Normal  Distribution  of  Calculation  Errors 
-  Bond  Angle,  PM3 


Table  8.  Compound  with  Largest  Deviation  of  Calculated  Error 
of  Bond  Angle  Using  the  PM3  Method 


# 

CHEMICAL 

BOND 

EXPERIME 

NTAL 

CALC.  1 
ERRORS  8 

1 

H4N2 

HN-NH 

90.0 

90.3  | 

2 

_ H*°2  - 

HO-OH 

119.8 

60.2  I 

3 

_ 

H'FH 

108.0 

39.0  | 

4 

F3Br2 

F'BrF 

86.2 

33.8  I 

5 

F,C1 

FC1F ' 

00 

vj 

• 

Ol 

32.5  I 

-1  -J  -I  0  1 

l  -KOK 


«  4  ( 


Figure  18.  Test  for  Normal  Distribution  of  Calculation  Errors 
-  Bond  Angle,  MNDO 


Table  9.  Compound  with  Largest  Deviation  of  Calculated  Error 
of  Bond  Angle  Using  the  MNDO  Method 


Figure  19.  Test  for  Normal  Distribution  of  Calculation  Errors 
-  Bond  Angle,  AMI 


Table  10.  Compound  with  Largest  Deviation  of  Calculated  Error 
of  Bond  Angle  Using  the  AMI  Method 


BOND  ANGLE  -P M3 
EXTREMES  EXCLUDED  (5  points) 


-5  -4  -J  •!  -1  0 

i-a 
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Figure  22.  Test  for  Normal  Distribution  of  Calculation  Errors 
-  Bond  Angle,  AMI  -  Extremes  Excluded 
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Table  11.  Recommended  Computational  Methods  for  the 
Different  Physical /Chemical  Properties 


COMPUTED 

PROPERTY 

RECOMMENDED 

METHOD 

BIAS 

0 

HEAT  OF 
FORMATION 

PM3 

0.44Kcal/mole 

12.3Kcal/mole  | 

IONIZATION 

POTENTIAL 

MNDO 

0.68  ev 

0.70  ev 

DIPOLE  MOMENT 

AMI 

0.49  deby 

BOND  LENGTH 

AMI 

0.08  A 

0.064  A 

BOND  ANGLE 

AMI 

1.24° 

4.35° 
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